3 May 2021 (updated: 2021-04-22)

i. Week 15 of 16 Focus

  • Students receive and act on Ch.19 feedback.
  • Students apply computational methods to real world data (Final Exam/Poster). Final Exam due Friday May 7 2021. Glad to look at ahead of time for feedback prior to grading.
  • Discuss a modern, simulation-based workflow for Bayesian time series modeling.

Today’s plan

  • More final exam discussion and guidelines
  • Discuss a modern, simulation-based workflow for Bayesian time series modeling.

Please contribute questions and comments

1 – Introduction

Today we’ll fit Bayesian time series models

Student learning outcomes

(I hope) that you’ll be able to…

  • Formally describe Bayesian time series models.
  • Understand how/why to fit Bayesian models using stan and R.
  • Understand how/why to use simulation to specify priors.
  • Understand how/why to simulate to from predictive posterior distributions.
  • Compare models using predictive performance.

Software used

  • Mostly the excellent rethinking R package [@McElreath2020] that occupanying the textbook Statistical Rethinking.
  • rethinking is a very convenient suite of functions for Bayesian inference that uses/works well with the Hamiltonian Monte Carlo - based software stan.
  • The way you specify a model in rethinking is essential the same way you would write it down in prose.

2 – Background

Bayes rule

  • Bayesian rule is a powerful restatement of conditional probability:

\[ P(\theta | data) = \frac{P(data | \theta) P(\theta)}{\int_\Omega P(data | \theta) P(\theta) d\theta} \]

  • Parameters are not fixed for Bayesians. We are uncertain of their values, and so we (must) regard probability as the “the way” to reason with them [@Jaynes2003; @Lindley2000].
  • We learn through Bayes rule, by combining prior knowledge with the data being analyzed, we obtain our updated belief
  • “the posterior is proportional to the prior times the likelihood”:

\[\pi(\theta \vert data) \propto \pi(\theta) \times \pi(data \vert \theta)\].

Bayesian regression

  • Regression amounts to defining a surface plus noise.
  • By specifying a conditional mean (conditional on the fixed covariate values), we can build a regression model.
  • Below is one way to describe a simple linear model with constant error:

\[ \begin{split} y_i &\sim \mathcal{N}( \alpha + \beta x_i , \sigma ) \\ \alpha &\sim p(\alpha) \\ \beta &\sim p(\beta) \\ \sigma &\sim p(\sigma) \\ \end{split} \]

Sampling the posterior

\[ \pi(\theta | data) = \frac{\pi(data | \theta) \pi(\theta)}{\int_\Omega \pi(data | \theta) \pi(\theta) d\theta} \]

  • The denominator is a constant scale factor that ensures \(\pi(\theta|data)\) is a proper probability distribution
    • We often say that the posterior is proportional to the prior times the likelihood: \(\quad \pi(\theta | data) \propto \pi(\theta) \times \pi(data | \theta)\)
  • Often the integral in the denominator is complex or of a high dimension
  • A solution is to use Markov Chain Monte Carlo (MCMC) simulations to draw samples from the posterior distribution

Prediction for model comparison

  • The problem with parameters: more params overfit to sample.
  • Instead, look for the regular features by estimating out-of-sample predictive performance.
  • It is straightforward to make predictions given a posterior by computing the (log) posterior predictive distribution using the \(S\) Monte Carlo samples:

\[ \mathrm{lppd}(y, \Theta) = \sum_i \log \frac{1}{S} \sum_s p(y_i | \Theta_s) \]

  • Bayesian leave-one-out cross validation (loo-cv):

\[ \mathrm{lppd}_{CV} = \sum_i \frac{1}{S} \sum_s \log p(y_{i} | \theta_{-i,s}) \]

3 - cherry_blossoms data set

A thousand years of cherry blossom data

People in Japan have been recording cherry blossom timings antd climate data for over 1000 years, from 801 - 2015, in our data set below.

library(rethinking)
## Loading required package: rstan
## Loading required package: StanHeaders
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## 
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
## 
##     extract
## Loading required package: parallel
## rethinking (Version 2.11)
## 
## Attaching package: 'rethinking'
## The following object is masked from 'package:purrr':
## 
##     map
## The following object is masked from 'package:stats':
## 
##     rstudent
data(cherry_blossoms)
d <- cherry_blossoms
precis(d)
x
1408.000000
104.540508
6.141886
7.185151
5.098941
x
350.8845964
6.4070362
0.6636479
0.9929206
0.8503496
x
867.77000
94.43000
5.15000
5.89765
3.78765
x
1948.23000
115.00000
7.29470
8.90235
6.37000
x
▇▇▇▇▇▇▇▇▇▇▇▇▁
▁▂▅▇▇▃▁▁
▁▃▅▇▃▂▁▁
▁▂▅▇▇▅▂▂▁▁▁▁▁▁▁
▁▁▁▁▁▁▁▃▅▇▃▂▁▁▁
d2 <- d[ complete.cases(d$doy) , ] # complete cases on doy
n <- nrow(d2)
d2$doy1 <- c( NA,  d2$doy[-n] ) # add a lag
d2 <- d2[ -1, ] ## for model comparison later

doy: first day of blooms

4- Modeling the first day of blooms

Model 0 SLR

\[ \begin{split} y_i &\sim \mathcal{N}( \alpha + \beta*year_i , \sigma ) \\ \alpha &\sim N(100, 10) \\ \beta &\sim N(0, 0.1) \\ \sigma &\sim Exponential(1) \\ \end{split} \]

m0 <- ulam(
    alist(
        y ~ dnorm( mu, sigma ) ,
        mu <- a + b*year ,
        a ~ dnorm(100, 10) ,
        b ~ dnorm(0, 0.1),
        sigma ~ dexp(1)
    ) , data = list( y=d2$doy, year=d2$year),
    log_lik = T)

Prior predicitve checks

## prior predictive checks
prior <- extract.prior( m0 )
plot( NULL, xlim = range(d2$year),
     ylim= c(50, 150), xlab="Year", ylab="Lines Sampled from the Prior Predictive" )
N <- 100
for (i in 1:N ) {
    curve( prior$a[[i]] + prior$b[[i]]*x ,
          from = min(d2$year), to = max(d2$year), add = T ,
          col = col.alpha( "black", 0.2 ) )
}

Posterior mean prediction

## precis( m0 )
post <- extract.samples( m0 )
mu <- link( m0 )
mu_PI <- apply( mu, 2, PI, 0.89 )
d3 <- data.frame( m0_mean = colMeans(mu), m0_lower = mu_PI[ 1, ], m0_upper = mu_PI[ 2, ] ) 
d3 <- cbind( d2, d3 )

Model 0 SLR mean estimation

Model 0 Posterior predictions

mu <- link( m0 )
mu_PI <- apply( mu, 2, PI, 0.89 )
D_sim <- sim( m0, n = 500 )
D_PI <- apply( D_sim, 2, PI, 0.89 )

Model 0 Posterior predictions

Model 1 AR(1)

\[ \begin{split} y_i &\sim \mathcal{N}( \alpha + \beta * y_{i-1}, \sigma ) \\ \alpha &\sim N(100, 10) \\ \beta &\sim N(0, 10) \\ \sigma &\sim Exponential(1) \\ \end{split} \]

m1 <- ulam(
    alist(
        y ~ dnorm( mu, sigma ) ,
        mu <- a + b*doy1,
        a ~ dnorm(100, 10) ,
        b ~ dnorm(0, 10),
        sigma ~ dexp(1)
    ) , data = list( y=d2$doy, doy1=d2$doy1),
    log_lik = T)
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

Model 1 AR(1) prior predictives

## prior predictive checksp
prior <- extract.prior( m1 )
N <- 100
ySim <- sapply( 1:N, function(k) ( prior$a[k] + prior$b[k]*d2$doy1 ) )
plot( NULL, xlim = range(d2$year),
     ylim= range(ySim), xlab="Year", ylab="Lines Sampled from the Prior Predictive" )
for (i in 1:N ) {
    lines(x = d2$year, y = ySim[, i],
          col = col.alpha( "black", 0.2 ) )
}

Model 1 AR(1) prior predictives

Try more informative prior:

m11 <- ulam(
    alist(
        y ~ dnorm( mu, sigma ) ,
        mu <- a + b*doy1,
        a ~ dnorm(100, 5) ,
        b ~ dnorm(0, 0.5),
        sigma ~ dexp(1)
    ) , data = list( y=d2$doy, doy1=d2$doy1),
    log_lik = T)
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

Model 1 AR(1) prior predictives

## prior predictive checks
prior <- extract.prior( m11 )
## vizualize
N <- 100
ySim <- sapply( 1:N, function(k) ( prior$a[k] + prior$b[k]*d2$doy1 ) )
plot( NULL, xlim = range(d2$year),
     ylim= range(ySim), xlab="Year", ylab="Lines Sampled from the Prior Predictive" )
for (i in 1:N ) {
    lines(x = d2$year, y = ySim[, i],
          col = col.alpha( "black", 0.2 ) )
}

Model 1 AR(1) posterior mean

precis( m11)
x
91.2432696
0.1274956
6.2913129
x
3.2036524
0.0305252
0.1436954
x
86.3690575
0.0739177
6.0743488
x
96.8803828
0.1739626
6.5353897
x
90.18661
90.06525
218.27668
x
1.0031202
1.0029655
0.9981938
post <- extract.samples( m11 )
mu <- link( m11 )
mu_PI <- apply( mu, 2, PI, 0.89 )
d4 <- data.frame( m1_mean = colMeans(mu), m1_lower = mu_PI[ 1, ], m1_upper = mu_PI[ 2, ] )
d4 <- cbind( d3, d4 )

Model 1 AR(1) posterior mean

Model 1 AR(1) posterior predictions

mu <- link( m11 )
mu_PI <- apply( mu, 2, PI, 0.89 )
D_sim <- sim( m11, n = 500 )
D_PI <- apply( D_sim, 2, PI, 0.89 )

Model 1 AR(1) posterior predictions

Model 2 Splines

The likelihood:

\[ \begin{split} D_i &\sim \mathcal{N}( \mu_i , \sigma ) \\ \mu_i &= \alpha + \sum_{k=1}^K w_k B_{k,i}\\ \end{split} \]

With priors:

\[ \begin{split} \alpha &\sim N(100, 10) \\ w_j &\sim N(0, 10) \\ \sigma &\sim Exponential(1) \\ \end{split} \]

Preprocessing for Model 2 Spline

num_knots <- 15
knot_list <- quantile( d4[ , "year"], probs = seq( 0, 1, length.out = num_knots ) )
library(splines)
B <- bs( d4[ , "year"],
        knots = knot_list[ -c(1, num_knots) ],
        degree = 3, intercept = TRUE )

Preprocessing for Model 2 Spline

plot ( NULL, xlim=range(d4[ , "year"]), ylim = c(0,1), xlab="year", ylab="basis")
for (i in 1:ncol(B) ) lines( d4[ , "year"], B[ ,i ] )

Model 2 splines fitting

m2 <- quap(
    alist(
        D ~ dnorm( mu, sigma ) ,
        mu <- a + B %*% w,
        a ~ dnorm(100, 10) ,
        w ~ dnorm(0, 10),
        sigma ~ dexp(1)
    ) , data = list( D=d4$doy, B = B ) ,
    start = list( w = rep( 0, ncol(B) ) ) )

Model 2 splines fitting

precis( m2 )
## 17 vector or matrix parameters hidden. Use depth=2 to show them.
x
103.486754
5.870704
x
2.370453
0.143698
x
99.698312
5.641047
x
107.275196
6.100362
post <- extract.samples( m2 )
mu <- link( m2 )
mu_PI <- apply( mu, 2, PI, 0.89 )
d5 <- data.frame( m2_mean = colMeans(mu), m2_lower = mu_PI[ 1, ], m2_upper = mu_PI[ 2, ] )
d5 <- cbind( d4, d5 )

Model 2 splines posterior mean

Posterior predictions simulations from M2 Spline

  • Simulated data sets
  • observation uncertainty, as opposed to conditional mean uncertanity
mu <- link( m2 )
mu_PI <- apply( mu, 2, PI, 0.89 )
D_sim <- sim( m2, n = 1e4 )
D_PI <- apply( D_sim, 2, PI, 0.89 )

Posterior predictions from M2 Spline

5 - Model comparison through predictive preformance

Using PSIS-estimate of loo

Rather than fitting n models to estimate the loo-cv error, you can use a Pareto Smoothed Importance Sampled estimate [@Vehtari2016a].

compare( m0, m1, m11, m2, func=PSIS )
## Warning in compare(m0, m1, m11, m2, func = PSIS): Not all model fits of same class.
## This is usually a bad idea, because it implies they were fit by different algorithms.
## Check yourself, before you wreck yourself.
## Some Pareto k values are high (>0.5). Set pointwise=TRUE to inspect individual points.
x
5308.427
5389.889
5390.026
5412.220
x
43.31697
39.42397
39.63549
39.88648
x
0.00000
81.46220
81.59964
103.79364
x
NA
19.38428
19.22696
20.19328
x
16.578752
3.238938
2.464753
3.151195
x
1
0
0
0

Future work/other models to consider

  • Gaussian process using time separation to define covariance

Questions? and References

I’m glad to answer any questions while I display the references.